Song Analysis Project

Author

Ryan DeStefano and Sean Supp

Introduction

Based on what we discussed in our initial consulting meeting, our understanding is that you have 2 goals in regards to this data analysis. The first and primary goal being to recommend users of your music software, songs to listen to next based on the characteristics (danceability, acousticness, etc) of the current song they are listening to. Essentially a song recommendation algorithm, using previously listened to songs. The secondary goal is to be able to predict the genre of a song based on the same set of song characteristics.

It is our understanding that the desired outcome of this project is to increase the ratings/user experience of your music software, the improved recommendation algorithm will help to aid this goal and will be delivered in the form of a front end application (shiny app) along with a supplementary report outlining the steps of the analysis. Given this, the target audience is all users of your software. The shiny app will allow users to select a song and then recommend a list of songs listed in order of similarity based on the selected song.

The data you have provided us come in the form of 2 separate datasets, one for the primary goal and one for the secondary goal.

The first dataset contains 130,000 songs of which are available on your music software. Each row of the dataset is a specific song and is accompanied by a variety of variables (acousticness, danceability, energy, etc), a subset of these variables will be used to cluster the songs into groups. The song recommendations will be based on these clusters. For example, if a user of the app inputs a song, the list of recommended songs will be a subset of the songs in the input songs cluster.

Here is the link to the data described above: https://www.kaggle.com/datasets/tomigelo/spotify-audio-features?select=SpotifyAudioFeaturesApril2019.csv

The second dataset is a different subset of songs from your music software, this time 114,000 songs, the difference between this dataset and the first one is that this one contains the track genre attached as a column. The other columns contain the same exact song characteristics as the previous dataset. This dataset will be used to train a model that is able to predict the genre of a song based on the specific characteristics of the song.

Here is the link to the data described above: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset/data

Regarding the ethics of this analysis, one big issue could be the scope of the dataset being used for the recommendation algorithm. Not all artists are contained in the dataset, so, artists not present in the dataset will not be recommended as song choices for the user. This exclusion of certain artists is not a good thing as depending on how the songs for this dataset were chosen, artists could be systematically excluded. All artists are not being given a fair shot. It is likely that smaller artists will have less songs in the dataset meaning they will have a lesser chance at being recommended, while more popular artists will likely have more songs and be more likely to be recommended. So, this algorithm in its current state may reinforce the status quo of the music industry by keeping popular artists popular and barring less popular artists from breaking through.

Previous Work

Article:

  • https://medium.com/neemz-product/the-inner-workings-of-spotifys-ai-powered-music-recommendations-how-spotify-shapes-your-playlist-a10a9148ee8d
  • https://www.hopkinsmedicine.org/health/wellness-and-prevention/keep-your-brain-young-with-music
  • https://www.northshore.org/healthy-you/9-health-benefits-of-music/#:~:text=Music%20can%20boost%20the%20brain’s,involved%20in%20mood%20and%20emotions.

Summary of the first article:

Spotify’s rise to the top of the music streaming world is largely due to its ingenious use of AI to create personalized music experiences for its massive user base of over 500 million monthly listeners. The platform’s success hinges on its ability to understand user preferences and behaviors through data analysis and collaborative filtering. This process connects users with music that aligns with their tastes, creating playlists that feel custom-made. Spotify dives deep into the nuances of each track, considering aspects like beats, lyrics, and cultural context, to make finely-tuned recommendations.

As Spotify continues to evolve, it faces challenges like the cold start problem with new artists and potential biases in recommendations. To address these, it incorporates human expertise alongside its algorithms and continuously adjusts its system to ensure fairness and diversity. Looking to the future, Spotify is exploring innovative concepts like a virtual DJ persona, enhancing user interaction and making music discovery more engaging and personal. This blend of advanced technology and human insights sets Spotify apart in curating unique and resonant musical journeys for each user.

Summary of the second article:

This article explores how our brains process and perceive music. When we listen to music, sound vibrations travel through the air, enter the ear canal, and vibrate the eardrum. These vibrations are then converted into electrical signals that travel along the auditory nerve to the brain stem. Here, these signals are reconstructed into what we recognize as music.

The article also highlights research conducted by Johns Hopkins University, where jazz musicians and rappers were asked to improvise music while inside an fMRI machine. This study aimed to observe which areas of the brain are activated during musical improvisation.

The article emphasizes that music is not just an art form but also possesses structural, mathematical, and architectural qualities. It relies on the relationship between successive notes. Unbeknownst to many, the brain performs complex computations to interpret and make sense of music, illustrating the intricate connection between music and cognitive processes.

Summary of the third article:

The article highlights the numerous health benefits of music, emphasizing its ability to positively impact mood, reduce pain and anxiety, and provide a means for emotional expression. It delves into the role of music therapy in healthcare, particularly in hospice and palliative care. The therapy, conducted by board-certified music therapists, is used to complement traditional treatments for various conditions, including anxiety, depression, stress, pain management, and improving function in degenerative neurological disorders. The article underscores the significant role that music can play in enhancing both physical and mental health.

Similarities and Differences of ours work to the articles:

For our project, our main works/focuses are similar to the first article where we are trying to create a “machine learning” algorithm that helps/makes musics more engaging to the users where our project aim to recommend songs/playlists based on the songs that the users were previously listened to (i.e., recommend songs to each of the user’s taste). We would also deploy an user-interactive app (Shiny App) that displays the songs/playlists to play next.

Based on the second and the third article, we found that musics have the ability to positively impact mood, reduce pain and anxiety, and provide a means for emotional expression.

Since the later two articles are not “previous work”, so, the only difference of our project compared to the first article is that we also predict genre of songs based on a variety of measurements.

Exploratory Analysis

Variables of Interest:

Acousticness: confidence measure on if the song is acoustic

Danceability: measure of how well suited the song is for dancing

Energy: measure of intensity and activity

Instrumentalness: measure of the level to which the song is an instrumental track

Key: The key

Liveness: measure of the likelihood an audience was present during the songs recording

Loudness: measure of loudness in decibels

Mode: the modality (major or minor, 1 or 0)

Speechiness: measure of the quantity of spoken words

Tempo: beats per minute

Time Signature: estimated time signature

Valence: measure of positivity

Track Genre: song genre

Primary Goal Clustering Data:

Code
audio <- read_csv(here::here("Project", "Clustering Data.csv"))

audio <- audio %>% 
  mutate(
    across(.cols = c(key, mode, time_signature), 
           .fns = ~ as.factor(.x)
           )
  )
Code
audio %>% 
  select(-artist_name, -track_name)%>%
  summary()
   track_id          acousticness     danceability     duration_ms     
 Length:130663      Min.   :0.0000   Min.   :0.0000   Min.   :   3203  
 Class :character   1st Qu.:0.0316   1st Qu.:0.4590   1st Qu.: 163922  
 Mode  :character   Median :0.2030   Median :0.6050   Median : 201901  
                    Mean   :0.3425   Mean   :0.5815   Mean   : 212633  
                    3rd Qu.:0.6360   3rd Qu.:0.7270   3rd Qu.: 241048  
                    Max.   :0.9960   Max.   :0.9960   Max.   :5610020  
                                                                       
     energy       instrumentalness        key           liveness     
 Min.   :0.0000   Min.   :0.000000   1      :15348   Min.   :0.0000  
 1st Qu.:0.3960   1st Qu.:0.000000   0      :14972   1st Qu.:0.0975  
 Median :0.6030   Median :0.000149   7      :13930   Median :0.1240  
 Mean   :0.5692   Mean   :0.224018   2      :12363   Mean   :0.1949  
 3rd Qu.:0.7750   3rd Qu.:0.440000   9      :12029   3rd Qu.:0.2360  
 Max.   :1.0000   Max.   :1.000000   5      :10816   Max.   :0.9990  
                                     (Other):51205                   
    loudness       mode       speechiness         tempo        time_signature
 Min.   :-60.000   0:51254   Min.   :0.0000   Min.   :  0.00   0:   299      
 1st Qu.:-11.898   1:79409   1st Qu.:0.0389   1st Qu.: 96.01   1:  1749      
 Median : -7.979             Median :0.0559   Median :120.03   3: 12666      
 Mean   : -9.974             Mean   :0.1120   Mean   :119.47   4:112652      
 3rd Qu.: -5.684             3rd Qu.:0.1290   3rd Qu.:139.64   5:  3297      
 Max.   :  1.806             Max.   :0.9660   Max.   :249.98                 
                                                                             
    valence         popularity    
 Min.   :0.0000   Min.   :  0.00  
 1st Qu.:0.2240   1st Qu.:  7.00  
 Median :0.4200   Median : 22.00  
 Mean   :0.4396   Mean   : 24.21  
 3rd Qu.:0.6380   3rd Qu.: 38.00  
 Max.   :1.0000   Max.   :100.00  
                                  

Comment: Above is the summary of the distribution of the numerical variables in this dataset. Key, Mode, time_signautre are oridinal variables, the output display counts associated with that level.

Code
audio %>% 
  select(artist_name,track_name) %>%
  head()
# A tibble: 6 × 2
  artist_name track_name                                    
  <chr>       <chr>                                         
1 YG          Big Bank feat. 2 Chainz, Big Sean, Nicki Minaj
2 YG          BAND DRUM (feat. A$AP Rocky)                  
3 R3HAB       Radio Silence                                 
4 Chris Cooq  Lactose                                       
5 Chris Cooq  Same - Original mix                           
6 Curbo       Debauchery - Original mix                     

Comment: Above is the quick summary of how artist_name and track_name are inserted into this dataset. The dataset include songs from the same artist rather than having one artist for a song. There are too many artist in this data, therefore, the output will be a mess if I were to do a groupby artist name and count the songs (i.e., a table that displays number of songs for each artist).

Code
audio %>%
  ggplot(mapping = aes(x = danceability)) +
  geom_histogram(bins=40,fill = "blue", color = "black") +
  labs(x="danceability of a song", y="Count")+
  ggtitle("Distribution of a mixture of song features such as beat, tempo, etc. ")

Figure 1: Danceability Distribution

The distribution looks somewhat symmetry with a slight left-skewed. This shows that across all of the songs, most songs have a high danceability — around .5-.7. Given that we are going to perform a unsupervised learning clustering for this part, it might not be beneficial to add the danceability into our models since most songs have relatively the same danceability.

Code
audio %>%
  ggplot(mapping = aes(x = acousticness)) +
  geom_histogram(bins=40,fill = "pink", color = "black") +
  labs(x="acousticness of a song", y="Count")+
  ggtitle("Distribution of a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic")

Figure 2: Acousticness Distribution

The distribution looks equally spread out in the middle range (.25-.75) and some peaks towards both end. This suggests us that songs that are more similar are likely to have the same acousticness. In other words, songs are not clustered into one specific range, but rather they are spread out. Therefore, we should consider the acousticness into our model.

Code
audio %>%
  ggplot(mapping = aes(x = tempo)) +
  geom_histogram(bins=40,fill = "salmon", color = "black") +
  labs(x="tempo of a song (BPM)", y="Count")+
  ggtitle("Beats Per Minute Distribution ")

Figure 3: Tempo Distribution

Most songs are clustered around 90-150 with some peaks towards the end of this range. This shows that the tempo might not be good for clustering since most songs have relatively the same range of tempo.

Secondary Goal Genre Prediction Data:

Code
genre <- read_csv(here::here("Project", "modified_dataset.csv"))
Code
genre %>% select(-track_id,-artists,-album_name, -track_name,-track_genre,-`Unnamed: 0`) %>%
summary()
   popularity      duration_ms       explicit        danceability   
 Min.   :  0.00   Min.   :      0   Mode :logical   Min.   :0.0000  
 1st Qu.: 17.00   1st Qu.: 174066   FALSE:104253    1st Qu.:0.4560  
 Median : 35.00   Median : 212906   TRUE :9747      Median :0.5800  
 Mean   : 33.24   Mean   : 228029                   Mean   :0.5668  
 3rd Qu.: 50.00   3rd Qu.: 261506                   3rd Qu.:0.6950  
 Max.   :100.00   Max.   :5237295                   Max.   :0.9850  
     energy            key            loudness            mode       
 Min.   :0.0000   Min.   : 0.000   Min.   :-49.531   Min.   :0.0000  
 1st Qu.:0.4720   1st Qu.: 2.000   1st Qu.:-10.013   1st Qu.:0.0000  
 Median :0.6850   Median : 5.000   Median : -7.004   Median :1.0000  
 Mean   :0.6414   Mean   : 5.309   Mean   : -8.259   Mean   :0.6376  
 3rd Qu.:0.8540   3rd Qu.: 8.000   3rd Qu.: -5.003   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :11.000   Max.   :  4.532   Max.   :1.0000  
  speechiness       acousticness    instrumentalness      liveness     
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00e+00   Min.   :0.0000  
 1st Qu.:0.03590   1st Qu.:0.0169   1st Qu.:0.00e+00   1st Qu.:0.0980  
 Median :0.04890   Median :0.1690   Median :4.16e-05   Median :0.1320  
 Mean   :0.08465   Mean   :0.3149   Mean   :1.56e-01   Mean   :0.2136  
 3rd Qu.:0.08450   3rd Qu.:0.5980   3rd Qu.:4.90e-02   3rd Qu.:0.2730  
 Max.   :0.96500   Max.   :0.9960   Max.   :1.00e+00   Max.   :1.0000  
    valence           tempo        time_signature 
 Min.   :0.0000   Min.   :  0.00   Min.   :0.000  
 1st Qu.:0.2600   1st Qu.: 99.22   1st Qu.:4.000  
 Median :0.4640   Median :122.02   Median :4.000  
 Mean   :0.4741   Mean   :122.15   Mean   :3.904  
 3rd Qu.:0.6830   3rd Qu.:140.07   3rd Qu.:4.000  
 Max.   :0.9950   Max.   :243.37   Max.   :5.000  

Above is the summary of numerical variables in the classification dataset. It describes the distribution of the numerical variables.

Code
sample_genre <- genre %>% sample_n(500) %>% as.data.frame() 
Code
sample_genre %>% ggplot(aes(x = loudness,y=danceability,color=track_genre)) +
  geom_point()+
  labs(x="Loudness of a song (DCB)",
       y="Danceability of a song")

Figure 4: Loudness vs. Danceability scatterplot

We do not see a clear distinct diference between genres. It suggests that the two variables (Loudness vs. Danceability) might not help contribute much into the groups classification.

Code
sample_genre %>%
  ggplot(mapping = aes(x = loudness,color=track_genre)) +
  geom_histogram(bins=40) +
  labs(x="Loudness of a song  (DCB)", y="Count")+
  ggtitle("Loundess vs. Count")

Figure 5: Loudness Distribution

The distribution of the loudness variable is left skewed. This means that most of the values are relatively high, with fewer lower values pulling the mean towards the left.

Code
sample_genre %>%
  ggplot(mapping = aes(x = danceability, color=track_genre)) +
  geom_histogram(bins=40) +
  labs(x="danceability of a song", y="Count")+
  ggtitle("Danceability vs. Count")

Figure 6: Danceability distribution

To us, this looks like a multimodal distribution. This suggests that the dataset contains different genres and each genre has its own relative danceability distribution.

Code
sample_genre %>%
  ggplot(mapping = aes(x = time_signature)) +
  geom_histogram(bins=20,fill = "salmon", color = "black") +
  labs(x="Time signature of a song", y="Count")+
  ggtitle("Time Signature Distribution")

Figure 7:

Although the time signature is a numerical variable, we should treat it as a categorical — time signature is not on a continuous numerical variable.

Code
sample_genre %>%
  ggplot(mapping = aes(x = tempo)) +
  geom_histogram(bins=40,fill = "salmon", color = "black") +
  labs(x="tempo of a song", y="Count")+
  ggtitle("Tempo Distribution")

Figure 8: Tempo Distribution

There are outliers in the tempo variable (e.g., 0 and 200). Most songs are lie in between 100-150. This suggests us that tempo might not be a good candidate for clustering analysis; most songs are clustered into a certain range — not spread out.

Analysis

Primary Goal (Clustering):

We have determined that the features, track_id, duration_ms, and popularity will not be used for the clustering of the data. The reason for this is that contextually, these features don’t make sense to use. Track_id has a unique value for each song, thus will not be any help in clustering the songs. Duration_ms represents the length of the song, we feel that the length shouldn’t be a determining factor in the clusters because it has nothing to do with the sound of the song, if a user likes the sound of the song, the length will likely not play a role in deciding whether they like it or not. Popularity, which is a numerical value based on the number of plays the song has had on Spotify was dropped because the popularity of a song has nothing to do with the actual makeup of the song. These variables were removed from the dataset, and categorical variables were converted to one hot encoded dummy variables.

Code
audio <- audio %>%
  dplyr::select(-track_id, -duration_ms, -popularity)

categorical_vars <- c("mode", "key", "time_signature")

dummy_data <- dummyVars("~.", data = audio[categorical_vars], fullRank = TRUE) %>%
  predict(audio[categorical_vars])

audio_encoded <- cbind(audio %>% select(-categorical_vars), dummy_data)
Code
audio_matrix <- audio_encoded %>%
  select(-artist_name, -track_name) %>%
  as.matrix()

artists <- audio %>% 
  pull(artist_name)

songs <- audio %>% 
  pull(track_name)

pc <- prcomp(audio_matrix, 
             center = TRUE, 
             scale = TRUE)

new_dims_df <- pc$x %>%
  as.data.frame() %>% 
  bind_cols(artists, songs) %>% 
  rename(Artist = `...26`, Song = `...27`)

We performed a principal component analysis as a data pre processing step. The purpose of this is to reduce the number of features used for clustering by removing extraneous noise in the data, while still keeping information from all the initial features. We will perform the clustering on both some subset of the principal components as well as the original set of features and decide which method is the best at separating the songs into groups based on looking at silhouette scores.

Silhouette score is a metric often used within clustering to help determine the number of clusters to use in the analysis. Silhouette score is essentially a grade for how clear and distinct the clusters are. A high score (closer to 1) means that the groups we found are well-defined and separate from each other. To decide on the best number of clusters to use, we clustered the data multiple times, using a different amount of clusters each time, and examined which number of clusters gave the highest silhouette score. The number of clusters that leads to the highest silhouette score implies that that clustering set up leads to the most well defined and separate groups. Now, this wasn’t the only criteria for deciding the clusters, another factor was the simplicity of the clustering. That is, using a small number of clusters is simpler and easier to understand than using a large number of clusters. For example, a small number of clusters that leads to a silhouette score only slightly lower than the score of a large number of clusters, the smaller number of clusters will be chosen.

Clustering With PCA:

Code
cumul_vars <- cumsum(pc$sdev^2)/sum(pc$sdev^2)

17 principal components account for more than 85% of the variation in the data, therefore we will use the first 17 PCs as the features for clustering. This is a reduction to 17 features from 25, this removal of the last 8 principal components removes extraneous noise from the data and we still have information from all of the 25 features in these first 17 PCs.

We will perform clustering on the 17 PCs, to determine how many clusters to use, we ran the clustering algorithm multiple times, using different amounts of clusters. The silhouette score will be calculated for each of the number of clusters, the number of clusters that leads to the highest silhouette score while also taking into account simplicity will be chosen. The plot can be seen below. It’s also worth noting that due to the size of the dataset we are working with (130,00 songs), we used a subset of 10,000 songs to test the different number of clusters to use due to computational limits of our computers.

Code
sample_pc <- new_dims_df %>% 
  sample_n(size = 10000, replace = FALSE)

k_values <- 2:30 

silhouette_scores <- numeric(length(k_values))

for (k in k_values) {
  kmeans_model <- kmeans(sample_pc[, 1:17], centers = k, algorithm = "MacQueen", iter.max = 1000)
  cluster_assignments <- kmeans_model$cluster
  
  silhouette_scores[k - 1] <- mean(silhouette(cluster_assignments, dist(sample_pc[, 1:17]))[ ,3])
}

scree_data <- tibble(k = k_values, silhouette = silhouette_scores)

ggplot(scree_data, aes(x = k, y = silhouette)) +
  geom_line(color="green") +
  geom_point(color="black") +
  labs(title = "Scree Plot with Silhouette Scores Based on PC Clustering",
       x = "Number of Clusters (k)",
       y = "Silhouette Score")

In the scree plot. it is apparent that 17 clusters is the optimal choice. This number of clusters leads to a high silhouette score while also not be a very large number of clusters. We see that there are other choices of clusters that have a slightly higher silhouette score, however, these higher number of clusters only marginally increase the silhouette score. Thus, the simper, smaller number of clusters with a still very high score comparatively, is the optimal choice. Now we will fit that optimal number of clusters.

Code
set.seed(27)

audio_km <- kmeans(new_dims_df[ , 1:17], centers = 17, algorithm = "MacQueen", iter.max=1000)

results_pca <- tibble(
  clust = audio_km$cluster, 
  Artist = artists,
  Song = songs)

table <- results_pca %>% 
  count(clust)

kable(table, 
      format = "html",
      col.names=c("Cluster", "Num of Songs"),
      align=c("c", "c")
      )
Cluster Num of Songs
1 19617
2 8276
3 1962
4 2553
5 10847
6 675
7 9045
8 3509
9 15376
10 6798
11 7771
12 7757
13 12748
14 7757
15 10500
16 3297
17 2175

The plot below displays the results of the clustering in the dimensions of the first two principal components. These first 2 PCs account for around 21% of the variation in the data, so, the graph is not going to show distinct separation between the clusters. But, it is still gives us some sort of visual on how the clusters are being formed.

Code
new_dims_df %>%
  ggplot(aes(x = PC1, y = PC2, color = as.factor(results_pca
                                                 $clust))) +
  geom_point(alpha=.2) +
  scale_color_discrete(name = "Cluster")

Clustering With Original Features:

We perform clustering on the original 25 features (this includes the dummy variables), to determine the optimal number of clusters we will follow the same steps we had previously for the clustering on the PCs. The silhouette score will be calculated for each iteration with a different number of clusters, the number of clusters that leads to the highest silhouette score while also taking into account simplicity will be chosen. The plot can be seen below. It’s also worth noting that due to the size of the dataset we are working with (130,00 songs), we used a subset of 10,000 songs to test the different number of clusters to use due to computational limits of our computers.

Code
audio_matrix_normalized <- as.data.frame(audio_matrix) %>%
  mutate_if(is.numeric, scale)

sample_full <- audio_matrix_normalized %>% 
  sample_n(size = 10000, replace = FALSE)

silhouette_scores <- numeric(length(k_values))

for (k in k_values) {
  kmeans_model <- kmeans(sample_full, centers = k, algorithm = "MacQueen", iter.max = 1000)
  cluster_assignments <- kmeans_model$cluster
  
  silhouette_scores[k - 1] <- mean(silhouette(cluster_assignments, dist(sample_full))[ ,3])
}

scree_data <- tibble(k = k_values, silhouette = silhouette_scores)

ggplot(scree_data, aes(x = k, y = silhouette)) +
  geom_line(color="green") +
  geom_point(color="black") +
  labs(title = "Scree Plot with Silhouette Scores Based on Scaled Features",
       x = "Number of Clusters (k)",
       y = "Silhouette Score")

In the scree plot. it is apparent that 13 clusters is the optimal choice. This number of clusters leads to a high silhouette score while also not be a very large number of clusters. We see that there are other choices of clusters that have a slightly higher silhouette score, however, these higher number of clusters only marginally increase the silhouette score. Thus, the smaller number of clusters with a still very high score comparatively, is the optimal choice. Now we will fit that optimal number of clusters.

Code
set.seed(27)

audio_km <- kmeans(audio_matrix_normalized, centers = 13, algorithm = "MacQueen", iter.max=1000)

results <- tibble(
  clust = audio_km$cluster, 
  Artist = artists,
  Song = songs)

table <- results %>% 
  count(clust)

kable(table, 
      format = "html",
      col.names=c("Cluster", "Num of Songs"),
      align=c("c", "c")
      )
Cluster Num of Songs
1 38845
2 8947
3 2327
4 3416
5 12273
6 809
7 14471
8 3506
9 8389
10 6971
11 8547
12 8368
13 13794
Code
new_dims_df %>%
  ggplot(aes(x = PC1, y = PC2, color = as.factor(results$clust))) +
  geom_point(alpha=.3) +
  scale_color_discrete(name = "Cluster")

So, which method is better? PCA or using all features in the original state:

Examining the silhouette scores from the previous parts we see that the best PCA clustering has a silhouette score of around .3 while the best clustering using all the features has a score of around .225. This is a large difference. The perk of the noise removal in PCA as well as uncorrelated principal components also boosts this method. Therefore the PCA clustering is better and will be used.

Characterizing the Clusters:

To delve into the specific characteristics of the songs within each cluster we will look at the average value of each numerical feature within each of the clusters, and for the categorical features we will look at the distribution of the levels of each of the categorical features within each cluster compared to the overall distribution of the levels of the categorical features. When looking at the average value of the numerical features within each cluster, we standardized the values so they all fall on a same scale, basically creating a z-score for each of the numerical features for each of the clusters. This will draw a clear picture for us of what features (song characteristics) make each cluster unique.

Code
for_graph <- audio %>% 
  select(acousticness, danceability, energy, instrumentalness, liveness, loudness,          speechiness, tempo, valence)

for_graph <- for_graph %>% 
  mutate_all(~scale(.))

new_column_names <- c("acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence")
colnames(for_graph) <- new_column_names

for_graph$cluster = results_pca$clust

graph <- for_graph %>% 
  group_by(cluster) %>% 
  summarise(across(where(is.numeric), ~mean(.)))

long <- tidyr::gather(graph, key = "feature", value = "value", -cluster)
long$feature <- factor(long$feature)

The three following tables display the distribution of the categorical variables key, mode, and time signature within each cluster, the last row in each of the tables contains the overall distribution in the dataset.

Code
for_table <- audio %>% 
  select(key, mode, time_signature)

for_table$cluster = as.factor(results_pca$clust)

two_way_table <- table(for_table$cluster, for_table$key)
normalized_table <- prop.table(two_way_table, margin=1)
rounded_table <- round(normalized_table, 3)

colnames(rounded_table) <- paste("Key =", colnames(normalized_table))
rownames(rounded_table) <- paste("Cluster =", rownames(normalized_table))

overall <- for_table %>%
  group_by(key) %>%
  summarize(Proportion = n() / nrow(for_table))

overall$Proportion = round(overall$Proportion, 3)
extended_table <- rbind(rounded_table, overall$Proportion)

kable(extended_table, caption = "Proportion of Key Values Within Each Cluster")
Proportion of Key Values Within Each Cluster
Key = 0 Key = 1 Key = 2 Key = 3 Key = 4 Key = 5 Key = 6 Key = 7 Key = 8 Key = 9 Key = 10 Key = 11
Cluster = 1 0.116 0.096 0.103 0.036 0.069 0.091 0.024 0.132 0.078 0.104 0.074 0.077
Cluster = 2 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Cluster = 3 0.378 0.000 0.336 0.000 0.000 0.000 0.000 0.000 0.286 0.000 0.000 0.000
Cluster = 4 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000
Cluster = 5 0.112 0.122 0.097 0.040 0.068 0.078 0.064 0.097 0.076 0.085 0.084 0.078
Cluster = 6 0.244 0.092 0.155 0.099 0.042 0.070 0.000 0.106 0.074 0.074 0.000 0.043
Cluster = 7 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Cluster = 8 0.138 0.092 0.123 0.067 0.080 0.095 0.000 0.115 0.064 0.095 0.075 0.058
Cluster = 9 0.024 0.000 0.006 0.001 0.000 0.000 0.967 0.001 0.001 0.000 0.000 0.000
Cluster = 10 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000
Cluster = 11 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000
Cluster = 12 0.287 0.092 0.188 0.038 0.027 0.037 0.017 0.105 0.077 0.056 0.013 0.063
Cluster = 13 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000
Cluster = 14 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000
Cluster = 15 0.000 0.000 0.000 0.218 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.782
Cluster = 16 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000
Cluster = 17 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000
0.115 0.117 0.095 0.033 0.069 0.083 0.068 0.107 0.071 0.092 0.071 0.079
Code
two_way_table <- table(for_table$cluster, for_table$mode)
normalized_table <- prop.table(two_way_table, margin=1)
rounded_table <- round(normalized_table, 3)

colnames(rounded_table) <- paste("Mode =", colnames(normalized_table))
rownames(rounded_table) <- paste("Cluster =", rownames(normalized_table))

overall <- for_table %>%
  group_by(mode) %>%
  summarize(Proportion = n() / nrow(for_table))

overall$Proportion = round(overall$Proportion, 3)
extended_table <- rbind(rounded_table, overall$Proportion)

kable(extended_table, caption = "Proportion of Mode Values Within Each Cluster")
Proportion of Mode Values Within Each Cluster
Mode = 0 Mode = 1
Cluster = 1 0.296 0.704
Cluster = 2 0.289 0.711
Cluster = 3 0.251 0.749
Cluster = 4 0.268 0.732
Cluster = 5 0.383 0.617
Cluster = 6 0.311 0.689
Cluster = 7 0.603 0.397
Cluster = 8 0.326 0.674
Cluster = 9 0.624 0.376
Cluster = 10 0.529 0.471
Cluster = 11 0.397 0.603
Cluster = 12 0.295 0.705
Cluster = 13 0.626 0.374
Cluster = 14 0.409 0.591
Cluster = 15 0.600 0.400
Cluster = 16 0.476 0.524
Cluster = 17 0.513 0.487
0.392 0.608
Code
two_way_table <- table(for_table$cluster, for_table$time_signature)
normalized_table <- prop.table(two_way_table, margin=1)
rounded_table <- round(normalized_table, 3)

colnames(rounded_table) <- paste("Time Signature =", colnames(normalized_table))
rownames(rounded_table) <- paste("Cluster =", rownames(normalized_table))

overall <- for_table %>%
  group_by(time_signature) %>%
  summarize(Proportion = n() / nrow(for_table))

overall$Proportion = round(overall$Proportion, 3)
extended_table <- rbind(rounded_table, overall$Proportion)

kable(extended_table, caption = "Proportion of Time Signature Values Within Each Cluster")
Proportion of Time Signature Values Within Each Cluster
Time Signature = 0 Time Signature = 1 Time Signature = 3 Time Signature = 4 Time Signature = 5
Cluster = 1 0.000 0.000 1.000 0.000 0.000
Cluster = 2 0.000 0.006 0.000 0.994 0.000
Cluster = 3 0.000 0.005 0.000 0.995 0.000
Cluster = 4 0.000 0.009 0.000 0.991 0.000
Cluster = 5 0.000 0.000 0.000 0.000 1.000
Cluster = 6 0.017 0.072 0.000 0.911 0.000
Cluster = 7 0.000 0.010 0.001 0.989 0.000
Cluster = 8 0.000 0.000 1.000 0.000 0.000
Cluster = 9 0.000 0.019 0.051 0.931 0.000
Cluster = 10 0.000 0.008 0.000 0.991 0.000
Cluster = 11 0.014 0.053 0.313 0.620 0.000
Cluster = 12 0.017 0.043 0.029 0.912 0.000
Cluster = 13 0.000 0.005 0.001 0.994 0.000
Cluster = 14 0.016 0.050 0.027 0.907 0.000
Cluster = 15 0.000 0.007 0.000 0.993 0.000
Cluster = 16 0.000 0.007 0.000 0.993 0.000
Cluster = 17 0.000 0.001 0.003 0.997 0.000
0.002 0.013 0.097 0.862 0.025

The following graphs contain all the standardized scores for each of the numerical features with the specific scores for the cluster we are isolating, highlighted. This allows us to identify what song characteristics of the songs within the clusters stand out.

Cluster 1: Fast Pace, Key 0

Using the graph below, this cluster is defined by high levels of valence and tempo, and low values of acousticness and instrumentalness. This cluster also has around 50% of its songs in key 0 while only 10% of all songs are in key 0. So, songs in this cluster are likely to be faster paced and not have a lot acoustics (more synthesizers used).

Code
highlight_cluster <- 1

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 1",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.

Cluster 2: High Tempo, Loudness, and Energy, Key 2, Major Mode

This cluster has very high values for loudness, tempo, and energy. Very low values for acousticness and instrumentalness. High proportion of major mode songs compared to the entire dataset and all songs are in key 2.

Code
highlight_cluster <- 2

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 2",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 3: Low Tempo and Valence, Fairly High Acousticness, Key 10, Time Signature 3

This cluster is the opposite of the previous ones, it has higher values for acousticness and instrumentalness and lower values for the features having to with the loudness and pace of the songs. High proportion of time signature 3 songs and key 10 songs.

Code
highlight_cluster <- 3

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 3",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 4: Slow Tempo, High Acousticness, Key 2

Same as cluster 3, but now all the songs are key 2 instead of 10.

Code
highlight_cluster <- 4

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 4",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 5: Fairly Balanced, Moderate Pace, Key 7, Major Mode

This group of songs has fairly average values for the numerical features but is definitely on the less acoustic side, and moderate pace. All songs are in key 7. High proportion of songs in the major mode compared to the rest of the dataset.

Code
highlight_cluster <- 5

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 5",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 6: Balanced, Major Mode, Time Signature 3, Key 3

This cluster is even more balanced that the previous cluster, no extreme values for the numerical features.Almost all of the songs are in time signature 3, this is rare. All the songs are in key 3, and a very high proportion of major mode songs.

Code
highlight_cluster <- 6

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 6",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 7: Extreme Acousticness and Instrumentalness, Slow Pace, Balanced Keys

This cluster is defined by extremes. It has the highest value for acousticness and instrumentalness. Lowest values for tempo, valence, danceability, etc. Very balanced distribution of keys. songs in this group are likely to have no words, maybe classical music.

Code
highlight_cluster <- 7

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 7",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 8: Balanced, Time Signature 4, Key 3, Minor Mode

Again, we have a very balanced cluster in terms of the numerical features. All the songs are in time signature 4, all songs in key 3, and high proportion of songs with minor mode.

Code
highlight_cluster <- 8

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 8",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 9: Fast Paced, Not a Lot of Instruments, Key 4 and 11, Minor Mode

This cluster is back to the old trend, fast paced and low instrumentalness/acousticness. Half of the songs are in key 4 and half in 11. Very high proportion of songs with minor mode.

Code
highlight_cluster <- 9

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 9",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 10: Very Fast Paced and Danceable, Minor Mode, Key 10

Very high danceability, valence, and tempo. Low instrumentalness and acousticness. Very high proportion of songs in minor mode, and all songs in key 10.

Code
highlight_cluster <- 10

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 10",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 11: Fairly High Pace, Key 8

This cluster is back to the old trend, fairly high pace and low instrumentalness/acousticness. All songs are in key 8.

Code
highlight_cluster <- 11

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 11",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 12: Fairly High Tempo, Key 6, Minor Mode

Very similar to cluster 11, except slightly less extreme in all numerical features. All songs in key 6 and songs tend to be in minor mode.

Code
highlight_cluster <- 12

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 12",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 13: Very High Speechiness, Key 1

Very high danceability, tempo, and speechiness. Very low values of acousticness. All of the songs are in key 1.

Code
highlight_cluster <- 13

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 13",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 14: Relatively Fast Paced, Mixed Time Signatures, Minor Mode

Back to the old trend, faster paced song with low instrumentalness. All songs are in key 5 and very high proportion of minor mode songs compared to the rest of the dataset.

Code
highlight_cluster <- 14

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 14",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 15: Balanced, Time Signature 3

Middle of the road in almost every numerical category. All songs are in time signature, this is very rare, very dispersed distribution of keys.

Code
highlight_cluster <- 15

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 15",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 16: Extreme Speechiness and Liveness, Time Signature 5

Extreme values of speechiness and liveness. Tends to point to these songs being poems or even podcasts. All songs in time signature 5

Code
highlight_cluster <- 16

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 16",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Cluster 17: Slow Pace, High Instrumentalness, Key 5

This goes back to the slower paced, high instrumentalness/acousuticness. All songs in key 5.

Code
highlight_cluster <- 17

ggplot(long, aes(x = feature, y = value, color = factor(cluster))) +
  geom_point(size = 1.5, alpha = 0.5) +
  geom_point(data = filter(long, cluster == highlight_cluster), 
             aes(x = feature, y = value), 
             color = "black", size = 1.5) +
  labs(title = "Visualising Cluster Makeup",
       subtitle = "Cluster 17",
       y = "Cluster Standardized Center") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1),
        plot.title = element_text(hjust = 0.5),  # Center the title
        plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
        plot.margin = margin(20, 20, 20, 20)) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_color_manual(values = rep("grey", n_distinct(long$cluster))) +
  guides(color = FALSE)

Secondary Goal (Genre Prediction):

Now we will use a secondary dataset which contains all the same features that the dataset for clustering has, this time with different songs. This dataset also has the song genre as a column, this is the variable we are trying to predict. Below some data pre-processing is being done, converting the categorical variable to factors and dropping columns that are unique to each song, as they won’t help in predicting the genre. It is worth noting that the original dataset contained 114 different genres, we regrouped these genres into 13 more general genres. Some of the genres were very niche and many people who would be using the shiny app would not understand what some of the output genres were. The regrouping is for ease of use purposes.

Code
genre <- genre %>% 
  mutate(
    across(.cols = c(key, mode, time_signature, track_genre), 
           .fns = ~ as.factor(.x)
           )
  ) %>% 
  select(-`Unnamed: 0`, -track_id, -artists, -album_name, -track_name, -popularity, -duration_ms, -explicit)

The model specification that we are going with is a random forest model. This is a good choice for our data because we have a decently large amount of predictors, and we are not entirely sure as to which are going to be the most important in predicting genre. Random forests uses a variety of predictors for each subtree, so, we don’t have to choose which predictors to include in the model with the lack of context we have on these predictors.

Below we are taking a random stratified sample of the entire dataset to perform hyperparameter tuning. This is needed because there are large number of songs in the dataset (114,000), we lack the computational power to perform tuning on that large of dataset. This reduction of the dataset allows us to tune the model while still giving it a large sample. Once the hyperparameters are chosen, the model will be fit on the full training data.

Code
set.seed(27)

sampled_genre <- genre %>%
  group_by(track_genre) %>%
  sample_n(size = 1000, replace = FALSE) %>%
  ungroup()

genre_cvs <- vfold_cv(sampled_genre, v = 3)

genre_recipe <- recipe(track_genre ~ ., data = sampled_genre)
Code
set.seed(27)

rf_spec <- rand_forest(mtry = tune(), 
                       min_n = tune(), 
                       trees = 10) %>%
  set_engine("ranger") %>%
  set_mode("classification")

rf_grid <- grid_regular(mtry(c(3, 12)),
                        min_n(),
                        levels = 3
                        )  

rf_wflow <- workflow() %>%
  add_model(rf_spec) %>% 
  add_recipe(genre_recipe)

rf_grid_search <-
  tune_grid(
    rf_wflow,
    resamples = genre_cvs,
    grid = rf_grid
  )

tuning_metrics <- as.data.frame(rf_grid_search %>% collect_metrics())
tuning_metrics <- filter(tuning_metrics, .metric=="accuracy")

kable(tuning_metrics %>%
  filter(.metric == "accuracy") %>%
  slice_max(mean) %>% 
  select(mtry, min_n, .metric, mean),
  col.names=c("Mtry", "Min n", "Metric", "Value"),
  align=c("c", "c", "c", "c")
  )
Mtry Min n Metric Value
3 21 accuracy 0.355231
Code
kable(tuning_metrics %>% 
      select(mtry, min_n, .metric, mean),
      col.names=c("Mtry", "Min n", "Metric", "Value"),
      align=c("c", "c", "c", "c")
  )
Mtry Min n Metric Value
3 2 accuracy 0.3372311
7 2 accuracy 0.3364620
12 2 accuracy 0.3404621
3 21 accuracy 0.3552310
7 21 accuracy 0.3488468
12 21 accuracy 0.3453084
3 40 accuracy 0.3433080
7 40 accuracy 0.3507696
12 40 accuracy 0.3474619

We see that the model that leads to the highest accuracy is the model with an mtry of 3 and min_n of 21. This also happens to be the 2nd simplest model, thus, it is an easy choice to move forward with this model. We then fit the model with these hyperparameters to the whole dataset.

Code
full_genre_recipe <- recipe(track_genre ~ ., data = genre)

rf_spec <- rand_forest(mtry = 3, 
                       min_n = 21, 
                       trees = 10) %>%
  set_engine("ranger") %>%
  set_mode("classification")

rf_wflow <- workflow() %>%
  add_model(rf_spec) %>% 
  add_recipe(full_genre_recipe)

rf_fit <- rf_wflow %>%
  fit(genre)
Code
#detaching the caret package becasue it is leading to a namespace error in the below code
detach("package:caret", unload = TRUE)

full_genre_cvs <- vfold_cv(genre, v = 5, strata=track_genre)

table <- rf_fit %>%
  fit_resamples(full_genre_cvs,
                metrics = metric_set(accuracy)) %>%
  collect_metrics()

kable(table %>% 
      select(.metric, mean),
      col.names=c("Metric", "Value"),
      align=c("c", "c")
      )
Metric Value
accuracy 0.4758771

From the results we see that the accuracy was fairly good, it has a cross validated accuracy of around 0.47. The accuracy of .47 means that when predicting song genres using our model, we should expect that it will predict the correct genre around 47% of the time.

Ethical Implications / Limitations

Regarding the ethics of this analysis, one big issue could be the scope of the dataset being used for the recommendation algorithm. Not all artists are contained in the dataset, so, artists not present in the dataset will not be recommended as song choices for the user. The dataset contained a random selection of 130,000 songs that were present on Spotify as of April 2019. That is to say that there are no songs from May 2019 and on. This exclusion of certain artists is not a good thing. Since the songs were being collected randomly, artists with more songs on Spotify were more likely to be included in the dataset. All artists are not being given a fair shot. It is likely that smaller artists will have less songs in the dataset meaning they will have a lesser chance at being recommended, while more popular artists will likely have more songs and be more likely to be recommended. So, this algorithm in its current state may reinforce the status quo of the music industry by keeping popular artists popular and barring less popular artists from breaking through.

The other main issue is the timeframe of the songs used for the analysis. We are currently in 2023, the dataset used for the analysis does not contain any songs from the past 4 years, this means that our algorithm will not recommend any of these newer songs. This presents a fairly big issue as users are likely going to want to have some contemporary music in their recommended playlists, and the analysis in its current state does not provide this.